Gentle Masking of Low-Complexity Sequences Improves Homology Search

نویسنده

Martin C. Frith

چکیده

Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with "gentle" masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is min(0,S), where S is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to "harsh" masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search.

متن کامل

منابع مشابه

A new algorithm for detecting low-complexity regions in protein sequences

MOTIVATION Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are b...

متن کامل

Compact Encoding Strategies for DNA Sequence Similarity Search

Determining whether two DNA sequences are similar is an essential component of DNA sequence analysis. Dynamic programming is the algorithm of choice if computational time is not the most important consideration. Heuristic search tools, such as BLAST, are computationally more efficient, but they may miss some of the sequence similarities (Altschul et al., 1990). These tools often use common k-tu...

متن کامل

A new repeat-masking method enables specific detection of homologous sequences

Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, TAN...

متن کامل

Exact sequences of extended $d$-homology

In this article, we show the existence of certain exact sequences with respect to two homology theories, called d-homology and extended d-homology. We present sufficient conditions for the existence of long exact extended d- homology sequence. Also we give some illustrative examples.

متن کامل

Fast and Complete Search of siRNA Off-target Sequences

Smith-Waterman alignment algorithm is favored in search for siRNA off-target instead of the BLAST algorithm, because BLAST tends to overlook some significant homologous sequences, especially when they are short (21 nt~27 nt). Smith-Waterman algorithm, however, suffers from its own shortcomings, especially its inefficiency in searching through a large sequence database. This paper presents a two...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل

عنوان ژورنال:

دوره 6 شماره

صفحات -

تاریخ انتشار 2011

Gentle Masking of Low-Complexity Sequences Improves Homology Search

نویسنده

چکیده

منابع مشابه

A new algorithm for detecting low-complexity regions in protein sequences

Compact Encoding Strategies for DNA Sequence Similarity Search

A new repeat-masking method enables specific detection of homologous sequences

Exact sequences of extended $d$-homology

Fast and Complete Search of siRNA Off-target Sequences

عنوان ژورنال:

اشتراک گذاری